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Abstract 

We demonstrate an equivalence between repro- 
ducing kernel Hilbert space (RKHS) embeddings 
of conditional distributions and vector-valued re- 
gressors. This connection introduces a natural 
regularized loss function which the RKHS em- 
beddings minimise, providing an intuitive under- 
standing of the embeddings and a justification for 
their use. Furthermore, the equivalence allows 
the application of vector-valued regression meth- 
ods and results to the problem of learning con- 
ditional distributions. Using this link we derive 
a sparse version of the embedding by consider- 
ing alternative formulations. Further, by apply- 
ing convergence results for vector-valued regres- 
sion to the embedding problem we derive mini- 
max convergence rates which are 0(\og(n)/n) 
- compared to current state of the art rates of 
are valid under milder and more 
intuitive assumptions. These minimax upper 
rates coincide with lower rates up to a logarith- 
mic factor, showing that the embedding method 
achieves nearly optimal rates. We study our 
sparse embedding algorithm in a reinforcement 
learning task where the algorithm shows signifi- 
cant improvement in sparsity over an incomplete 
Cholesky decomposition. 



1. Introduction/Motivation 

In recent years a framework for embedding probability dis- 
tributions into reproducing kernel Hilbert spaces (RKHS) 
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has become increasingly popular ( Smo la et al.||2007| l. One 
example of this theme has been the representation of condi- 
tional expectation operators as RKHS functions, known as 
conditional mean embeddings ( Song et al.| 2009| l. Con- 
ditional expectations appear naturally in many machine 
learning tasks, and the RKHS representation of such ex- 
pectations has two important advantages: first, conditional 
mean embeddings do not require solving difficult interme- 
diate problems such as density estimation and numerical 
integration; and second, these embeddings may be used 
to compute conditional expectations directly on the basis 
of observed samples. Conditional mean embeddings have 
been successfully applied to inference in graphical models, 
reinforcement learning, subspace selection, and conditional 
independence testing (Fukumizu et al. 2008| 2009 Song 
[etlI^[20^[MT0] [G"runewald er et al-lpOllT 
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The main motivation for conditional means in Hilbert 
spaces has been to generalize the notion of conditional ex- 
pectations from finite cases (multivariate Gaussians, con- 
ditional probability tables, and so on). Results have been 
established for the convergence of these embeddings in 
RKHS norm ( |Song et al.| [20091 12010) , which show that 
conditional mean embeddings behave in the way we would 
hope (i.e., they may be used in obtaining conditional ex- 
pectations as inner products in feature space, and these es- 
timates are consistent under smoothness conditions). De- 
spite these valuable results, the characterization of condi- 
tional mean embeddings remains incomplete, since these 
embeddings have not been defined in terms of the opti- 
mizer of a given loss function. This makes it difficult to 
extend these results, and has hindered the use of standard 
techniques like cross-validation for parameter estimation. 

In this paper, we demonstrate that the conditional mean em- 
bedding is the solution of a vector-valued regression prob- 
lem with a natural loss, resembling the standard Tikhonov 
regularized least-squares problem in multiple dimensions. 
Through this link, it is possible to access the rich the- 
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ory of vector- valued regression (Micchelli & Pontil| [2005 ; 



Carmeli et al) 2006HCaponnetto & De Vito||2007]|Capon- 



netto et al. 2008 1. We demonstrate the utility of this con- 
nection by providing novel characterizations of conditional 
mean embeddings, with important theoretical and practical 
implications. On the theoretical side, we establish novel 
convergence results for RKHS embeddings, giving a signif- 
icant improvement over the rate of 0(n -1 / 4 ) due to |Song| 
et al.| ( |2009||2010| . We derive a faster 0(log(n)/n) rate 
which holds over large classes of probability measures, and 
requires milder and more intuitive assumptions. We also 
show our rates are optimal up to a log(n) term, following 
the analysis of Caponnetto & De Vito ( 2007| l. On the prac- 
tical side, we derive an alternative sparse version of the em- 
beddings which resembles the Lasso method, and provide 
a cross-validation scheme for parameter selection. 

2. Background 

In this section, we recall some background results concern- 
ing RKHS embeddings and vector-valued RKHS. For an 
introduction to scalar-valued RKHS we refer the reader to 



(Berl inet & Thomas-Agnan| [2004). 



2.1. Conditional mean embeddings 

Given sets X and y, with a distribution P over random 
variables (X, Y) from X x y we consider the problem of 
learning expectation operators corresponding to the con- 
ditional distributions P(Y\X = x) on y after condition- 
ing on x £ X. Specifically, we begin with a kernel 
L : y x y -> R, with corresponding RKHS % L C R y , and 
study the problem of learning, for every x £ X, the condi- 
tional expectation mapping Hl 9 ft i-> E[ft(Y")|X = a;]. 
Each such map can be represented as 

E[h(Y)\X = x] = (h,n(x)) L , 

where the element p,(x) £ Hl is called the (conditional) 
mean embedding of P(Y\X = x). Note that, for every 
x, fj,(x) is a function on y. It is thus apparent that fi is 
a mapping from X to Hl, a point which we will expand 
upon shortly. 

We are interested in the problem of estimating the embed- 
dings fi(x) given an i.i.d. sample {(xi,yi)}f =1 drawn from 
P n . Following ( |Song et al.| |2%09| [20T0| , we define a sec- 
ond kernel K : X x X ->■ M with associated RKHS H K , 
and consider the estimate 



(D 



i=l 



where cti(x) — ^^— 1 WijK{xj,x), and where W := 
(K + XnI)-\ K = (KixuXj))^, and A is a cho- 
sen regularization parameter. This expression suggests that 



the conditional mean embedding is the solution to an un- 
derlying regression problem: we will formalize this link 
in Section [3] In the remainder of the present section, we 
introduce the necessary terminology and theory for vector 
valued regression in RHKSs. 

2.2. Vector- valued regression and RKHSs 

We recall some background on learning vector-valued 



functions using kernel methods (see Micchelli & Pon- 
|tTI| [2005 for more detail). We are given a sample 
{(xi, Vi)}i< m drawn i.i.d. from some distribution over 
X x V, where A" is a non-empty set and (V, (•, -)y) is a 
Hilbert space. Our goal is to find a function / : X — > V 
with low error, as measured by 



E {xy) [\\f(X)-V\\ 2 v } 



(2) 



This is the vector-valued regression problem (square loss). 

One approach to the vector-valued regression problem is to 
model the regression function as being in a vector-valued 
RKHS of functions taking values in V, which can be de- 
fined by analogy with the scalar valued case. 

Definition A Hilbert space (H, (•, -)r) of functions h : 
X -> V is an RKHS if for all a; € X,v € V the linear 
functional h H> (v, ft(x))y is continuous. 

The reproducing property for vector-valued RHKSs fol- 
lows from this definition (see Micc helli & Pontflj [2005 
Sect. 2). By the Riesz representation theorem, for each 
x E X and v e V, there exists a linear operator from V to 
T-Lt written T x v £ Hp, such that for all ft £ Hr, 

(v, h(x)) v = (h,T x v) r . 

It is instructive to compare to the scalar-valued RKHS T-Lk, 
for which the linear operator of evaluation S x mapping ft £ 
Hk to h(x) £ K is continuous: then Riesz implies there 
exists a K x such that yh(x) = (ft, yK x )x- 

We next introduce the vector-valued reproducing kernel, 
and show its relation to T x . Writing as £(V) the space 
of bounded linear operators from V to V, the reproducing 
kernel x') £ £(V) is defined as 

T{x,x')v = (T x ,v)(x) £ V. 

From this definition and the reproducing property, the fol- 
lowing holds ( |Micchelli & Pontil[ [2003] Prop. 2.1). 

Proposition 2.1. A function V : X x X — > £(V) is 
a kernel if it satisfies: (i) T(x,x') = T(x',x)*, (ii) 
for all n £ N, {(xi, fi)}i< n C X x V we have that 
T,i,j<n( V i' T ( X i> X j) V j}v > fl- 
it is again helpful to consider the scalar case: here, 
(K x ,K x i)k = K(x,x'), and to every positive definite 
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kernel K(x, x') there corresponds a unique (up to isome- 
try) RKHS for which K is the reproducing kernel. Sim- 
ilarly, if r : X x X — > C{V) is a kernel in the sense 



of Proposition 2. 1 there exists a unique (up to isometry) 
RKHS, with r as its reproducing kernel (M icchelli & Pon-| 
|tfll |2005l Th. 2.1). Furthermore, the RKHS U T can be 
described as the RKHS limit of finite sums; that is, Hp is 
up to isometry equal to the closure of the linear span of the 
set {T x v : x £ X,v eV}, wit the RKHS norm || • || r . 

Importantly, it is possible to perform regression in this set- 
ting. One approach to the vector-valued regression prob- 
lem is to replace the unknown true error (|2]) with a sample- 
based estimate Yl7=i \ \ v i ~ f( x i)\\v> restricting / to be an 
element of an RKHS Hr (of vector-valued functions), and 
regularizing w.r.t. the Hr norm, to prevent overfitting. We 
thus arrive at the following regularized empirical risk, 



&(/) :=^lk-/(Ollv + * 



(3) 



Theorem 2.2. \Micchelli & Pontil\ [2005] Th. 4) Iff* min- 
imises £\ in Hr then it is unique and has the form, 

n 
i=l 

where the coefficients {ci}i< n , Ci € V are the unique solu- 
tion of the system of linear equations 



(T(xj,Xi) + XSji)ci = vj, 1 < j < n. 



In the scalar case we have that |/(x)| < y/K{x, x)\\f\\K- 
Similarly it holds that ||/(a;)|| < |||r(x,a;)||| ||/j| r , where 
1 1 1 • 1 1 1 denotes the operator norm (Mic chelli & PontiT{ |2005 
Prop. 1). Hence, if 1 1 \T(x : x)\\\ < B for all x then 



\\f{x)\\v < B\\f\\ T . 



(4) 



Finally, we need a result that tells us when all functions in 
an RKHS are continuous. In the scalar case this is guaran- 
teed if K(x, •) is continuous for all x and K is bounded. In 



our case we have ( Carmeli et al. 2006)[Prop. 12]: 

Corollary 2.3. If X is a Polish space, V a separabe Hilbert 
space and the mapping x i— > T(-,x) is continuous, then 
is a subset of the set of continuous functions from X to V. 

3. Estimating conditional expectations 

In this section, we show the problem of learning condi- 
tional mean embeddings can be naturally formalised in the 
framework of vector-valued regression, and in doing so we 
derive an equivalence between the conditional mean em- 
beddings and a vector-valued regressor. 



3.1. The equivalence between conditional mean 
embeddings and a vector- valued regressor 

Conditional expectations E[ft,(F)|X = x] are linear in the 
argument h so that, when we consider h E Hl, the Riesz 
representation theorem implies the existence of an element 
fi(x) e %l such that E[h(Y)\X = x] = (h,p,(x)) L for 
all h. That being said, the dependence of p, on x may be 
complicated. A natural optimisation problem associated to 
this approximation problem is to therefore find a function 
/i : X — > Hl such that the following objective is small 



sup E x (Ey[/i(y)|X]-(/i,/x(X))j 



(5) 



Note that the risk function cannot be used directly for esti- 
mation, because we do not observe Ey [h(Y) \X], but rather 
pairs (X, Y) drawn from P. However, we can bound this 
risk function with a surrogate risk function that has a sam- 
ple based version, 



sup Ex 



sup Ex 

l|h||£<l 



(E Y [h(Y)\X]-{h,(i(X)) L ; 

(E Y [{h,L(Y,.)) L \X]-{h,fi(X)) L ) 

< sup E x ,Y[(h,L{Y,')-n(X))l] 

< sup \\hf L E XiY [||L(y,.)-Mp0lli] 

\\Hl<1 

= E (XtY) [\\L(Y,.)-rtX)\\l], (6) 



where the first and second bounds follow by Jensen's and 
Cauchy-Schwarz's inequalities, respectively. Let us denote 
this surrogate risk function as 



£M :=E ( x,y) [\\L(Y, ■) - fi(X)\\ 



(7) 



The two risk functions £ and £ s are closely related and in 
Section 133] we examine their relation. 



We now replace the expectation in ([6]) with an empirical 
estimate, to obtain the sample-based loss, 



£nM == ^2 \\L{Vi, ') ~ K x i)\\ 



(8) 



Taking (|8]l as our empirical loss, then following Section 2.2 
we add a regularization term to provide a well-posed prob- 
lem and prevent overfitting, 



We denote the minimizer of (|9jl by „, 

Aa,« := argmin I £x,n[fA \ 



(9) 



(10) 
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Thus, recalling we can see that the problem (lOi 
is posed as a vector-valued regression problem with the 
training data now considered as {(xi, L(y i: (and 
we identify Hl with the general Hilbert space V of Sec- 
tion |2T2l. From Theorem|2.2| the solution is 



(11) 



AA,n = £ r ^> 



i=l 



where the coefficients {ci};<„, Cj G Hl are the unique 
solution of the linear equations 

^2(r(xj,Xi) + XSji)ci = L(yj, •), 1 < j < n. 



Input/Output space 


(i) X is Polish. 

(ii) V is separable, {f.b.a.) 
(\\\\ ^r 1 n tunh that Wt c y 

mo ^> u sucn inai vx t 
Tr(T(x,x)) < C holds. 


opace 01 repressors 


V^*/ /T-P ^ aCUdlaUlC. 

(v) All T* are HS. (f.b.a.) 

(vi) B: (a;,!/) -> (/, T(y, x) 5 > v 
is measurable V/, g G V. 


True distribution 


(vii) L(y, y) < 00 for all y & y. 

(viii) £ s [M*j = inf/xeWr S s[fA- 



Table 1. Assumptions for Corollary |4.1| and |4.2| f.b.a. stands for 
fulfilled by assumption that V is finite dimensional. 



It remains to choose the kernel T. Given a real-valued ker- 
nel K on X, a natural choice for the RKHS Hr would 
be the space of functions from X to Hl whose elements 
are defined as functions via (h, K(x, -))(x') :— K(x, x')h, 
which is isomorphic to Hl <8> Hk, with inner product 



(gK(x,-),hK(x',-)) T := ( 9) h) L K(x,x') 



(12) 



for all g, h G Hl- Its easy to check that this satis- 
fies the conditions to be a vector-valued RKHS- in fact it 
corresponds to the choice T(x,x') — K (x, x')ld, where 
Id : Hl — > T~Ll is the identity map on Hl- The solution to 
( [Tol l with this choice is then given by ( Til) , with 

^2(K(xj,Xi) + XS 3l )c t = L(yj,-), l<j<n 
Cj = y^W ij L(y j ,-), l<i<n, 

j<n 

where W = (K + A J) -1 , which corresponds exactly the 
embeddings ([TJi presented in (Song et al. 2009 ; 2010) (after 
a rescaling of A). Thus we have shown that the embeddings 
of Song et al. are the solution to a regression problem for 
a particular choice of operator-valued kernel. Further, the 
loss defined by |7]l is a key error functional in this context 
since it is the objective which the estimated embeddings 
attempt to minimise. In Sec. |3.3| we will see that this does 
not always coincide with |5]l which may be a more natural 
choice. In Sec. [4] we analyze the performance of the em- 
beddings defined by ( fTO] ) at minimizing the objective |7]). 

3.2. Some consequences of this equivalence 

We derive some immediate benefits from the connec- 
tion described above. Since the embedding problem has 
been identified as a regression problem with training set 
V := {(x i ,L(y i , -))}™ 1 , we can define a cross validation 
scheme for parameter selection in the usual way: by hold- 
ing out a subsample V val = {{x u , L(y tj , -))}J=i CD, we 
can train embeddings fi on D\D va \ over a grid of kernel or 



regularization parameters, choosing the parameters achiev- 
ing the best error Y%=i \\K x tj) ~ ; 01 li on the vali- 
dation set (or over many folds of cross validation). Another 
key benefit will be a much improved performance analysis 
of the embeddings, presented in Section|4] 

3.3. Relations between the error functionals £ and £ s 



In Section 13.11 we introduced an alternative risk function 
£ s for £, which we used to derive an estimation scheme to 
recover conditional mean embeddings. We now examine 
the relationship between the two risk functionals. When 
the true conditional expectation on functions h G Hl can 
be represented through an element /1* G Hr then /j* min- 
imises both objectives. 

Theorem 3.1 (Proof in App. [A}. // there exists a /x* G 
Hr such that for any h G Hl- E[/i|X] = (h,fi*(X))L 
Px-a.s., then jj,* is the Px-a.s. unique minimiser of both 
objectives: 



argminf \p] = argmin£ s [/i] 



■-"X a.s. 



Thus in this case, the embeddings of Song et al. (e.g. 2009) 
minimise both |5]l and (|7j. More generally, however, this 
may not be the case. Let us define an element fi that is 5 
close w.r.t. the error £ s to the minimizer // of £ s in Hr (this 
might for instance be the minimizer of the empirical regu- 
larized loss for sufficiently many samples). We are inter- 
ested in finding conditions under which £(fi) is not much 
worse than a good approximation p* in Hr to the condi- 
tional expectation. The sense in which fi* approximates 
the conditional expectation is somewhat subtle: p* must 
closely approximate the conditional expectation of func- 
tions p G Hr under the original loss £ (note that the loss £ 
was originally defined in terms of functions h G Hl). 

Theorem 3.2 (Proof in App. [A|. Let p! be a 

minimiser of £ s and p be an element of Hr with 
£s[P-} < £ s [n'} + S. Define, A := {{r],p)W = 
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sup|| M || r<1 Ex pEV[/x(A")|AT] — (fi(X), (l(X)) l] 2 } , then following convergence statements for the estimated embed 



8[p]< inf ( v ^] + v / 8r ? (||/x*||r + ||Allr) + ^ 



dings, under assumptions to be discussed in Section 4.2 



Corollary 4.1. Let b = oo,c > 1 then for every e > 
there exists a constant r such that 



Apart from the more obvious condition that S be small, 
the above theorem suggests that \ \fi\\r should also be made 
small for the solution fx to have low error £. In other words, 
even in the infinite sample case, the regularization of fi in 
T-Lr is important. 



lim sup sup P" 

n->oo Pe^>(b,c) 



£ s [/}„] - £ s [//] > t 



< e. 



Let 6 = 2 and c = 1 then for every e > there exists a 
constant t such that 



4. Better convergence rates for embeddings 

The interpretation of the mean embedding as a vector val- 
ued regression problem allows us to apply regression mini- 
max theorems to study convergence rates of the embedding 
estimator. These rates are considerably better than the cur- 
rent state of the art for the embeddings, and hold under 
milder and more intuitive assumptions. 

We start by comparing the statements which we derive from 
( |Caponnetto & De Vita] [2007] Thm.s 1 and 2) with the 
known convergence results for the embedding estimator. 
We follow this up with a discussion of the rates and a com- 
parison of the assumptions. 

4.1. Convergence theorems 

We address the performance of the embeddings defined 



by ( 10 1 in terms of asymptotic guarantees on the loss £ s 



defined by |7j. |Caponnetto & De Vito ( 2007| l study uni- 
form convergence rates for regression. Convergence rates 
of learning algorithms can not be uniform on the set of all 
probability distributions if the output vector space is an in- 



finite dimensional RKHS ( |Caponnetto & De Vito||2007| )[p. 
4]. It is therefore necessary to restrict ourselves to a sub- 
set of probability measures. This is done by |Caponnetto 
|& De Vito] ( |2007) > by defining families of probability mea- 
sures &(b, c) indexed by two parameters b g]l,oo] and 
c G [1, 2]. We discuss the family ^(b, c) in detail below. 
The important point at the moment is that b and c affect the 
optimal schedule for the regulariser A and the convergence 
rate. The rate of convergence is better for higher b and c 
values. Caponnetto & De Vito (2007) provide convergence 
rates for all choices of b and c. We restrict ourself to the 
best case b — oo, c > 1 and the worst cas^Jfe = 2, c = 1. 

We recall that the estimated conditional mean embed- 



dings are given by (10 1, where A is a chosen reg 



ularization parameter. We assume A„ is chosen to fol- 
low a specific schedule, dependent upon n: we denote 
by jl n the embeddings following this schedule and \J := 
argmin MgWr £ s [fj]. Given this optimal rate of decrease for 
A„, Thm. 1 of |Caponnetto & De Vito] ( |2007[ ) yields the 



'Strictly speaking the worst case is b \. 1 (see supp.). 



lim sup sup P n 

n-yoo P£3"(b,c) 

< e. 



£ s [£ n ] - £ s\P>'] > T 



logn 



The rate for the estimate jl n can be complemented with 
minimax lower rates for vector valued regression ( |Capon-| 
Inetto & De Vitol|2007l l[Th. 2] in the case that b <C oo. 

Corollary 4.2. Let 6 = 2 and c = 1 and let A„ := {/„ \l n : 

(X x y) n — > Hr} be the set of all learning algorithm 
working on n samples, outputting v n € Hr- Then for every 
e > there exists a constant r > such that 



lim inf inf sup P n 
> 1-e. 



£ s \fjf] > T - 

1 n 



This corollary tells us that there exists no learning algo- 
rithm which can achieve better rates than nT a uniformly 
over ^(2,1), and hence the estimate fi n is optimal up to a 
logarithmic factor. 

State of the art results for the embedding The current 



convergence result for the embedding is proved by Song 



|et al.| ( |2010| Th.l). A crucial assumption that we discuss in 
detail below is that the mapping x E[h(Y)\X = x] is 
in the RKHS Hk of the real valued kernel, i.e. that for all 
h E Hl we have that there exists a/j,6 Hk, such that 



E[h(Y)\X = x]=f h (x). 



(13) 



The result of Song et al. implies the following (see App. 
[C| : if K(x,x) < B for all x € X and the schedule 
A(n) = n -1 / 4 is used: for a fixed probability measure P, 
there exists a constant r such that 



lim P n 



£[fi n ] > T 



(14) 



where (i n {x) is the estimate from Song et al. No comple- 
mentary lower bounds were known until now. 
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Comparison The first thing to note is that under the as- 
sumption that E[/i|A] is in the RKHS Hk the minimiser 
of £ s and £ are a.e. equivalent due to Theorem |3.1| the 
assumption implies a u* G Hr exists with E[/i|A] = 
{h,n*{X)) L for all h 



G V (see App 
Hence, under this assumption, the statements from eq 
and Cor 



B.4 for details) 
14] 



4.1 ensure we converge to the true conditional 



expectation, and achieve an error of in the risk £. 



In the case that this assumption is not fullfilled and eq. 14 
is not applicable, Cor. |4.1| still tells us that we converge to 
the minimiser of £ s . Coupling this statement with Thm. |3.2| 
allows us to bound the distance to the minimal error £[;U*], 
where fi* £ Hr minimises £. 

The other main differences are obviously the rates, and that 
Cor. |4.1 1 bounds the error uniformly over a space of prob- 
ability measures, while eq. 14 provides only a point-wise 



statement (i.e., for a fixed probability measure P). 
4.2. Assumptions 

Cor. |4.1| and [4.2| Our main assumption is that Hl is fi- 
nite dimensional. It is likely that this assumption can be 
weakened, but this requires a deeper analysis. 



The assumptions of Caponnetto & De Vito ( 2007| ) are sum 
marized in Table [T] where we provide details in App. B.2 



App. |B.l | contains simple and complete assumptions that 
ensure all statements in the table hold. Beside some mea- 
sure theoretic issues, the assumptions are fulfilled if for ex- 
ample, 1) X is a compact subset of W 1 , y is compact, Hl 
is a finite dimensional RKHS, T and L are continuous; 2) 
£ s [p,'] = infpeWr £s [/•*]■ This last condition is unintuitive, 
but can be rewritten in the following way: 

Theorem 4.3 (Proof in App.). Let ||/i||y, \ \fi(x) - h\\ v be 
integrable for all h € Hr and let V be finite dimensional. 
Then there exists a a' € Hr with £ s \p'\ — inf Me % r £ s [fj] 
iff a B > exists and a sequence {/Lt n } n gN with £ s [u n ] < 
inf MeWr £ s \p] + l/n and ||/U n ||r < B. 

The intuition is that the condition is not fulfilled if we need 
to make u n more and more complex (in the sense of a high 
RKHS norm) to optimize the risk. 



Definition and discussion of ^(b, c) The family of 
probability measures &(b, c) is characterized through 
spectral properties of the kernel function V. The assump- 
tions correspond to assumptions on the eigenvalues in Mer- 
cer's theorem in the real valued case, i.e. that there are 
finitely many eigenvalues or that the eigenvalues decrease 
with a certain rate. In detail, define the operator A through 
A(<f>)(x') := f x T( x' x)(j>(x)dP x , where <j> € L 2 {P X ). 
A can be written as ( |Caponnetto & De Vito 2007| Rem. 
2) A — Yl n =l ^n(-i^>n)p^>ni where the inner product is 
the L? inner product with measure Px and N = oo is al- 



lowed. As in the real valued case, the eigendecomposition 
depends on the measure on the space X but is independent 
of the distribution on y. The eigendecomposition measures 
the complexity of the kernel, where the lowest complexity 
is achieved for finite N — that is, the case b = oo, c > 1 
— and has highest complexity if the eigenvalues decrease 
with the slowest possible rate, A„ < C/n for a constant C. 
The case b = 2, c = 1 correspond to a slightly faster decay, 
namely, A„ < C/n 2 . In essence, there are no assumptions 
on the distribution on y, but only on the complexity of the 
kernel T as measured with Px ■ 



Embedding The results of Song et al. (2010) do not rely 
on the assumption that V is finite dimensional. Other con- 
ditions on the distribution are required, however, which 
are challenging to verify. To describe these conditions, 
we recall the real-valued RKHS Hk with kernel K, and 
define the uncentred cross-covariance operator Cyx ■ 
H K -> H L such that (g,C Y xf)u L = E XY (f(X)g(Y)), 
with the covariance operator Cxx defined by analogy. 
One of the two main assumptions of Song et al. is that 
CyxC X x needs to be Hilbert-Schmidt. The covariances 
Cyx and Cxx are compact operators, meaning Cxx is 
not invertible when Hk is infinite dimensional (this gives 
rise to a notational issue, although the "product" operator 

-3/2 



CyxC xx 



3/2 

may still be defined). Whether Cyx C xx is 
Hilbert-Schmidt (or even bounded) will depend on the un- 
derlying distribution Pxy and on the kernels K and L. At 
this point, however, there is no easy way to translate prop- 
erties of Pxy to guarantees that the assumption holds. 

The second main assumption is that the conditional expec- 
tation can be represented as an RKHS element (see App 
|B~4) >. Even for rich RKHSs (such as universal RKHSs), it 
can be challenging to determine the associated conditions 
on the distribution Pxy- For simple finite dimensional 
RKHSs, the assumption may fail, as shown below. 

Coroiiary 4.4 (Proof in App. [C] ). Let V be finite dimen- 
sional such that a function h € V exists with h(y) > e > 
for all y £ y. Furthermore, let X := [—1,1] and the re- 
producing kernel for Hk be K(x,y) — xy. Then there 



exists no measure for which the assumption from eq. (13) 
can be fulfilled. 

5. Sparse embeddings 

In many practical situations it is desirable to approximate 
the conditional mean embedding by a sparse version which 
involves a smaller number of parameters. For example, 
in the context of reinforcement learning and planning, the 
sample size n is large and we want to use the embeddings 
over and over again, possibly on many different tasks and 
over a long time frame. 
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Here we present a technique to achieve a sparse approxi- 
mation of the sample mean embedding. Recall that this is 
given by the formula (cf. equation (jTTJ) 



ft(x) 



^2 WijK(xi,x)L(yj, 



where W = (K + nXI)^ 1 . A natural approach to find a 
sparse approximation of ft is to look for a function fi which 
is close to ft according to the RKHS norm || ■ ||r (in App. 
[D]we establish a link between this objective and our cost 
function £). In the special case that T = Kid this amounts 
to solving the optimization problem 



mm 



/(M-W0+ 7 ||M|| M (15) 
where 7 is a positive parameter, ||W||i,i := E?j l-^Sjl an( ^ 



f(M) 



n 

E 



MijK(xi, •) 



(16) 



Problem ([15) is equivalent to a kind of Lasso problem with 
n 2 variables: when 7 = 0, M = W at the optimum and 
the approximation error is zero, however as 7 increases, 
the approximation error increases as well, but the solution 
obtained becomes sparse (many of the elements of matrix 
M are equal to zero). 

A direct computation yields that the above optimization 
problem is equivalent to 



min tr((M 



W) T K(M -W)L)+j V \M tj \ 



E 



(17) 

In the experim ents in Section |5. II we solve problem ( fTT] ) 
with FISTA ( |Beck & Teboullej |2009) , an optimal first or- 
der method which requires 0(l/\/e) iterations to reach a e 
accuracy of the minimum value in ( fl7) , with a cost per it- 
eration of 0(n 2 ) in our case. The algorithm is outlined be- 
low, where S 1 {Z lJ ) = sign(Zij)(\Zij\-y) + and (z)+ = z 
if z > and zero otherwise. 

Algorithm 1 LASSO-like Algorithm 
input: W, 7, K, L output: M 

z 1 = Q 1 = o,e 1 = 1,C = \\K\\ \\L\\ 
fort=l,2,... do 

Z t +i = S yC {Qt-C {KQtL-GW L)) 

i +x /i+4e t 2 
ft+i 2 

Qt+i = Zt+i + -g^[(Zt - z t+i) 
end for 

Other sparsity methods could also be employed. For exam- 
ple, we may replace the norm || • 1 1 x , 1 by a block £1/^2 
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Figure 1. Comparison between our sparse algorithm and an in- 
complete Cholesky decomposition. The x-axis shows the level of 
sparsity, where on the right side the original solution is recovered. 
The y-axis shows the distance to the dense solution and the test 
error relative to the dense solution. 



That is, we may choose the norm ||M||2,i := 
which is the sum of the £2 norms of 



the rows of M. This penalty function encourages sparse 
approximations which use few input points but all the out- 
puts. Similarly, the penalty ||M T || 2i i will sparsify over 
the outputs. Finaly, if we wish to remove many pair 
of examples we may use the more sophisticated penalty 



5.1. Experiment 



We demonstrate here that the link between vector-valued 
regression and the mean embeddings can be leveraged to 
develop useful embedding alternatives that exploit proper- 
ties of the regression formulation: we apply the sparse al- 
gorithm to a challenging reinforcement learning task. The 
sparse algorithm makes use of the labels, while other al- 
gorithms to sparsify the embeddings without our regres- 
sion interpretation cannot make use of these. In particular, 
a popular method to sparsify is the incomplete Cholesky 
decomposition ( jShawe-Taylor & Cristianinl] |2004| l, which 
sparsifies based on the distribution on the input space X 
only. We compare to this method in the experiments. 

The reinforcement learning task is the under-actuated pen- 
dulum swing-up problem from Deisenroth et al. ([2009 ). We 
generate a discrete-time approximation of the continuous- 



time pendulum dynamics as done in (Deisenroth et al. 



20091. Starting from an arbitrary state the goal is to 



swing the pendulum up and balance it in the inverted po- 
sition. The applied torque is u € [— 5,5}Nm and is not 
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sufficient for a direct swing up. The state space is de- 
fined by the angle 9 E [— tt,tt] and the angular veloc- 
ity, u> £ [—7,7]. The reward is given by the function 
r(6,u>) = exp(— 2 — 0.2w 2 ). The learning algorithm is 
a kernel method which uses the mean embedding estima- 
tor to perform policy iteration (Griinewalder et al. 2012| l. 
Sparse solutions are in this task very useful as the policy 
iteration applies the mean embedding many times to per- 
form updates. The input space has 4 dimensions (sine and 
cosine of the angle, angular velocity and applied torque), 
while the output has 3 (sine and cosine of the angle and 
angular velocity). We sample uniformly a training set of 
200 examples and learn the mean conditional embedding 
using the direct method ([TJ. We then compare the sparse 
approximation obtained by Algorithm[T]using different val- 
ues of the parameter 7 to the approximation obtained via 
an incomplete Cholesky decomposition at different levels 
of sparsity. We assess the approximations using the test er- 
ror and ( fT6| ), which is an upper bound on the generalization 
error (see App. [Dj and report the results in Figure[T] 

6. Outlook 

We have established a link between vector-valued regres- 
sion and conditional mean embeddings. On the basis of 
this link, we derived a sparse embedding algorithm, showed 
how cross-validation can be performed, established better 
convergence rates under milder assumptions, and comple- 
mented these upper rates with lower rates, showing that the 
embedding estimator achieves near optimal rates. 

There are a number of interesting questions and problems 
which follow from our framework. It may be valuable to 
employ other kernels Y in place of the kernel K(x, y)Id 
that leads to the mean embedding, so as to exploit knowl- 
edge about the data generating process. As a related ob- 
servation, for the kernel Y(x, y) := K(x, y)Id , Y x is not a 
Hilbert-Schmidt operator if V is infinite dimensional, as 



ir : , 



|2 

\HS 



i=l 



however the convergence results from (Caponnetto & De 
|Vito| |2007| l assume Y x to be Hilbert-Schmidt. While this 
might simply be a result of the technique used in (Capon- 
|netto & De Vito| [2007), it might also indicate a deeper 
problem with the standard embedding estimator, namely 
that if V is infinite dimensional then the rates degrade. The 
latter case would have a natural interpretation as an over- 
fitting effect, as Id does not "smooth" the element h € V. 

Our sparsity approach can potentially be equipped with 
other regularisers that cut down on rows and columns of 
the W matrix in parallel. Certainly, ours is not the only 
sparse regression approach, and other sparse regularizers 
might yield good performance on appropriate problems. 
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SUPPLEMENTARY 



A. Similarity of minimisers 

We assume that all h G V are integrable with respect to the regular conditional probability P(-\x) and all p g Hr are 



B.l 



are fulfilled. 



integrable with respect to P. In particular, these are fulfilled if our general assumptions from Section 
Lemma A.l. If there exists fi* G Hr such that for any h G V: E[ft|X] = (h, fi* (X))\> Px-a.s., then for any (i € 'Hr- 

(i) E x ,y(L(Y, ■), n*(X)) v = E x \\ f i*(X)\\ 2 v , 

(ii) E XiY (L(Y,-),^X)) v =E x (fi*(X),ti(X)) v . 

Proof, (i) follows from (ii) by setting [i := p* . (ii) can be derived like that: 

E x (ii(X),n*(X)) v = E x E Y [p(X)(Y)\X}=E XtY (L(Y,-),p(X)) v , 
where we used Fubini's theorem for the last equality (Kallenberg, 2001 , Thm. 1.27). 



□ 



Theorem A.2 (Thm 3. 1 1. If there exists a fx* G %r such that for any h G V: E[/i|X] = (h, \i* (X)) v P x -a.s., then 



H* = argmin£[/z] = argmin £ s [/i] Pxa.s.. 
Furthermore, the minimiser is Px-a.s. unique. 

Proof. We start by showing that the right side is minimised by /i* . Let /i be any element in T~Lr then we have 

E X , Y [\\L(Y, •) - - E X , Y [\\L(Y, •) - M*P0llv] 

= E x ||mWIIv -2&x,Y(L(y,-),ii(X))v +2E XtY {L{Y,-),^{X)) v -E x \\ f f{X)\\ 2 v 
= Ex\\n(X)\$ - 2E x <jS{X),n{X)) v +E x \\lS{X)\$ = E x [\\p(X) - »*(X)\\ 2 V ] > 0. 

Hence, fi* is a minimiser. The minimiser is furthermore P^-a.s. unique: Assume there is a second minimiser // then above 
calculation shows that = E X>Y [\\L(Y, •) - //POIIy] ~ e x,y \\\L{ Y ■) - /x* Wllv l = Ex [||m*P0 - m'WIIv ] = 
E x [||/x*P0 - M'WIIv]- Hence, - m'POIIv = i^-a.s. ( |Fremlin| [2000)[122Rc], i.e. a measurable set M 

with P,*M = 1 exists such that - //(X)||y = holds for all X G M. As || • || v is a norm we have that 

H*(X) = n'(X) P x -n.s.. 

To show the second equality we note that, for every ft £ V, Ex(E[/i|X] — (h, n*(X))v) 2 = by assumption. Hence, 
the supremum over B(V) is also and the minimum is attained at fj,*(X). Uniqueness can be seen in the following way: 
assume there is a second minimiser p! then for all h G %x> we have E x ((h, fjf(X) — /i*(X) )y) 2 < E x ((h , (j/(X))v — 



E[h\X}) 2 +E x (E[h\X] - (h,p*(X)) v ) 2 = 0. Hence, (h, fi'(X) - fi*(X)) v = Oi^-a.s. ( |Fremlin|[2000| [122Rc], i.e 



a measurable set M with PxM = 1 exists such that (h, fjf(X) — /i* (X)}\> — holds for all X G M. Assume that there 
exists a X' G M such that n'(X') ^ fi*(X') then pick h := fi'(X') - fi* (X') and we have = (h,n'(X') -fi*(X')) v = 
-/x*(X')|| v > as || - ||v is a norm. Hence, n'{X') = fJ,*(X') a.s.. □ 



Theorem A.3. If there exists 7/ > anJ fx* G 'Hr s«c/z f/za/ sup^gg^^ E x [E[/i(X)\X] 
that p! is a minimiser of £ s and /t is an element ofHr with £ s [p] < £ s [//] + 5 then 

(i) £[//]< (y^l+^/VSdlMllr + llM'llr)' 2 



< M P0, M *(X)) V ] 



r/ < oo, 



(ii) £[p] < (Ve\jF] 



v 



1/4 



iMllrJ 



cl/2 



Proof. First, observe that if \i G %r then we have due to the Jensen inequality 

>(X) 



|ExE[/x(X)|X]-E x ^(X) ) /i*(X)> v | < H/xllrEx E 



X 



iMllr 



MHIr' 



< UmIIfA Ex E 



^E 


KV1 




.INIr 



l/<llr 



,M*W)v) = IMIrV»7- 
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We can now reproduce the proof of Lemma A. 1 with an approximation error. For any /i G T~Lr we have 

\E x ( f i(X), ^^X)) v - E X , Y (L(Y,-), ^(X)) v \ = \E x (v(X), ii^X)^ - E x E[v(X)\X}\ < ||//||rv^ 
In particular, 

\E x , Y {L(y,-),^(X)) v -E x>Y \\^(X)\\l\ <\\fx*\\rVv- 



Like in the proof of Thm. 3.1 we have for any zz that 



EM - £ s [n*} = ®x,y [\\L(Y, •) - n(X)\\*] - E X , Y [\\L(y, ■) - f(X)\$] 
> E x \\ f i(X)\\ 2 v - 2E x {lx*(X),li(X)) v + E x \\n*(X)\\l - 2||/z*|| rv /77 - 2||/z||rv^ 
= E X [\\li*(X) - li(X)\\*} - 2v^(llM*l|r + Mr)- 

In particular, \£ 8 [(j] - S.\ji*]\ > £ s [fj] - £ s [li*} > E x [\\fx*(X) - fi(X)\§] - 2^(||/i*llr + IMIr) and hence 

E X [\\fx*(X) - n(X)\\ 2 v ] < \£M - £ s [fx*]\ + 2^(||//||r + IMIr)- 

We can now bound the error £[zz] in dependence of how similar /z is to /z* in the surrogate cost function £ s : 



^m=^£W]+ sup s/e~ x [{h^*{X)-^X)) v ] 2 < j£\pF] + JlE x * { *\f 
heB(v) V VIlA* \ x ) ~ MWIIv 



< v^IaI + ^Vadl^llr + llMllr) + VI^M-^ImII, 



(18) 



(19) 



< ll/T'(to(JOIIv '^W _ ^ X))v ~ ( h ^*(X) - KX))v for any h e B(V) (observe 



where we used that 



independent of X) and eq. 19 



(20) 



that ft, is 



Now, for /i := jj! observe that £ s [li'] + 2^/rj(\\fi*\\r + ||A*||r) > £ s [zz*] follows from eq. (18iandas /z' is a £ s minmiser 
we have | £ [/z*] — £[li'] \ < 2y/rj(\\n*\\r + ||A*||r) an d from eq. 



20 



we get 



VW] < ^f£W] + »7 1/4 V8(p 



Similarly, for /z := /z we have 



V^M < V^l + ^VSdlMllr + HAllr) 



□ 



B. Assumptions 

B.l. Simple assumptions to make the convergence theorems hold. 



In this section, we present general assumptions that make the assumptions in (Caponnetto & De Vito 2007 Theorems 



1 and 2) hold. Weaker assumptions are certainly possible and we do not claim that our set of assumptions are the most 
general possible. 

B. 1 . 1 . Assumptions for the involved spaces and kernels 

We assume that A" is a compact subset of M. n , y is a compact set and V is a finite dimensional RKHS. Furthermore, 
we assume that we work on measure spaces (R, 38(E)), (X,38(X)), [y,38(y)), (V,<^(V)), where 38(E) etc. are the 
respective Borel algebras and each of the spaces is equipped with its norm induced topology. Finally, we assume that T 
and L are continuous. Hence, L(y, y) < B < oo for all y E y. 
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B.1.2. Data generating distribution 



The main object in this study is the probability measure P on the space X x y from which the data is generated. 
Formally, we have a measure spac^] (X x y, E, P) where E is a suitable cr-algebra and P : E M- [0, 1] a probabil- 
ity measure. Now we have a transformation from (j> : y H ► L(y, •) which turns a sample {(xi, j/i), (2:2,2/2): ■ • ■} into 
{(xi, •)), (x2,L(y2, ■), ■ ■ ■} with (xj, ■)) € A" x V. The measure P and the u-algebra E induce through the 
image measure construction a measure space (X x V, E, P), where E := \E\iji~ 1 \ E] 6 E,-E C X x V} is the largest 
cr-algebra on X x V such that is measurable and PE = P{4>~ 1 [E]) for all E £ E |Fremlin|[200o)l[l 12E]. 



We assumed already that X, V carry the Borel algebra. In addition assume that E = 3${X x y) and E = 8§{X x V), 
i.e. the Borel algebra on the product spaces. The assumptions we stated til l now guarantee us that the regu lar conditional 
probabilities P{B\x) and P(B\x) exists, where B G 38{V) and B £ S8iy) ( jSteinwart & Christmann]|2008| )[Lem. A.3.16]. 



Our assumptions on L and V guarantee that all h in V and \i £ Hr are continuous (Berlinet & Thomas- Agnan 2004 )[Th. 
17] and Cor. 2.3 Hence, all h and /1 are measurable as by assumption all spaces are equipped with Borel algebras (|Fremlin 
2003)[4A3Cb,d]. Furthermore, as each \h\ and is a continuous function on a compact set it is upper bounded by a 



constant B. Thus all h and /i are integrable with respect to any probability measure on (y, 38(y)) and (X x y, 3§{X xy)). 



The final assumption is that one of the two equivalent conditions of theorem B.3 is fulfilled. 



B.2. Verification of the Caponnetto & De Vito assumptions 



Theorem 1-2 from (Caponnetto & De Vito 2007) are based on a set of assumptions. The assumptions can be grouped into 
three categories: first, assumptions about the space X and V, second assumptions about the space of regression functions 
that we use and third about the underlying true probability distribution. Table [T] summarizes the assumptions that need to 
be fulfilled in our setting. In the following we discuss each assumption shortly and we show ways to verify them. Also 
note, that the general assumption from the last section guarantee that all assumptions in table[TJare fulfilled. 



Assumptions for X and V X needs to be a Polish space, that is a separable completely metrizable topological space. If 
X is, for example, a subset of R n equipped with any metric then X is Polish if X is closed. 

V needs to be a separable Hilbert space to be able to apply the theorems. In our case V is an RKHS so certainly a Hilbert 
space. Not all RKHSs are separable, but, due to our assumption that V is finite dimensional it follows that it is separable 
(e.g. take a basis e%, . . . , e n of V then the countable set {J2i<n 1i e i\li € Q} is dense in the RKHS norm). 

Finally, Tr(r(x, x)) < B needs to hold for all x € X, i.e. the trace needs to be bounded over X. 



Assumptions for T~Lr T~Lv needs to be separable. Separability of Hr and continuity of the kernel are closely related in 
the scalar case (Berlinet & Thomas-Agnan 2004)[Sec. 1.5]. Similarly, we have: 

Theorem B.l. "Hr is separable if X is a Polish space, V is a finite dimensional Hilbert space and T is continuous in each 
argument with respect to the topology on X. 

Coroilary B.2. If X = R™, V is a finite dimensional RKHS and T is a continuous kernel then is separable. 

Furthermore, the point evaluator T* (/) = f(x), f £ Hr, T* : Hr ^ V needs to be Hilbert-Schmidt for every x. This is 
in our case always fulfilled as V is finite dimensional: Let {ei}i<„ be an orthonormal basis of V: 

WKWhs = \\ t x\\hs = Yl H r * e iHr = X^ r z ei ' r * e ^ r = J2(e l ,T(x,x)e l ) v . 

But, T(x, x) is a finite dimensional matrix as V is finite dimensional and hence the sum is finite and has finite Hilbert- 
Schmidt norm. 

The next condition concerns the measurebility of the kernel, V/, g € V, B ; (x,x') — > (/, T(x', x)g)y needs to be 
measurable. The simplest case where this holds is where T is continuous as a map from (y, x) to C(V) and where we equip 
K, X and V with the Borel algebra. Then we have measurebility as compositions from continuous functions are continuous 
and continuous functions are Borel measurable. 



2 Some measure theory is needed to write things down cleanly. See also Steinwart & Christmann 2008)[chp. 2] where similar 
problems are discussed. 
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B.3. Assumptions for the "true" probability measure 

There are further assumption concerning the measures P and P from which the data is generated that need to be fulfilled 



to be able to apply the convergence theorems. We now follow up on the discussion of these measures in appendix B.1.2 
Again our general assumptions are sufficient to guarantee that the assumptions are fulfilled. 

The first property that P needs to fulfill is that J XxV \ \h\ \ydP(x, h) < oo where the integral is the Lebesgue integral. This 
is essentially trivially fulfilled for all P in our setting as P is concentrated on elements L(y, •), i.e. C = {L(y, -)\y e y} G 
£ as 0- x [£] = X x y G t and PC = P(0 _1 £) = P(X x y) = 1. Furthermore, \\L(y,-)\\ v = L{y,y) < B < oo 
if we have a bounded kernel L. The only problem is that we need to guarantee that \ \h\\y is integrable. Let us therefore 
assume in the following that ||/i||y \ C (i.e. the restriction to C) is measurable. We have \ \h\\y \ C < B\C where \ is 
the characteristic function. B\C is measurable as C is measurable. Furthermore, it is integrable as it is a simple function 
and PC = 1 < oo. Now, \\h \ £|| v is defined on the conegibile set C (i.e. PC = 1) and by assumption ||/i||y \ C is 



measurable. Hence , ||/i||y |" C is i ntegrable (Fremhn 2000| l[Lem. 122Jb] and as ||/i||y \ C =p~ a .e. ||/i|| v we have that 



2 is integrable (Fremhn 2000i[122Ld] with integral j XxV \\h\%dP(x, v) < B < oo. 



The next assumption concerns the infimum £[fi*] = inf Me -^ r i.e. the infimum must be attained at a /i* G Hr- One 
can restate this condition in terms of the RKHS norm of a sequence converging to the infimum: the infimum is attained iff 
such a sequence exists which is bounded in the RKHS norm. The intuition is here that the condition is not fulfilled if we 
need to make /i„ more and more complex (in the sense of a high RKHS norm) to optimize the risk. 

Theorem B.3. Let ||/i||y, — /i||y be integrable for all fi G TLr and let V be a finite dimensional Hilbert space. 

Then there exists a fx* G "Hp with £\p*] = inf^g-^j, £\p] iff a B > exists and a sequence {fJ, n } n ^n with £\p n ] < 
inf^e^r £\p] + l/n and\\fx n \\ r < B. 

Remark The important assumption is that the sequence {//„} is bounded in the RKHS norm as we can by definition 
always find a sequence {/i n } that converges to the infimum. 

Proof. "=>": set /x n := fj* then obviously the sequence converges to the infimum and is bounded as | |/x n | |p = ||m*I lr < 00 
as fi* G Hr- 

"-^=": We need to pull the limit through the integral and we need to make sure that the limit is attained by an RKHS 
function. First, let us check for the limit to be in the RKHS. {/x n }neN is a bounded sequence in a reflexive space (every 



Hilbert space is reflexive). Hence, there is a sub sequence {fi nk }kf=n which converges weakly (Werner 2002 l[Th. 111.3.7] 



Furthermore, Hr is closed and convex. Therefore, the weak limit ^ is in Hp ([Werner 2002) [Thm III. 3. 8] 

Now we turn to the integral. We want to apply the Lebesgue dominated convergence theorem to move the limit inside the 
integral. We need to construct a suitable sequence: Consider the sequence Sk{x, h) := {\\gk(%) — ^||v}feeN in R, where 
9k '■= Mnfe an d {MrifcjfceN is the convergent sub sequence. The sequence {sk(x, v)}k^ is a Cauchy sequence in K as for 
any e > we have 

| \\g k {x) -h\\ v - \\gi(x) - h\\ v \ < \\g k (x) - gi(x)\\ v = ^2(gk{x) - gi{x), e^y = ^{gk - gi,r x ei)r, 

i<u i<u 

where we used the Parseval equality and set u as the dimension of V. Now, as g k (x) is weakly convergent we can find for 
each i a N(i) such that (g k — gi,K x ei)K < e/u. Let = max{A^(l), . . . , N(u)} then we have 

| \\g k (x) - h\\ v - \\gi(x) - h\\ v \ < e. 

As M is complete the sequence {s k (x, h)} k ^ attains its limit for every pair (x,h) G U;<en Plfe>; dom(| \g k (x) — /i||y), i.e. 
whenever the sequence is defined for a pair (x, h). In particular, as ||<3%(ie) — h\\v is integrable there exists for every k a 
measurable set A k C dom(||gfc(x) — h\\v) with = 1, s k attains its limit on the set D := U/eNrifc>i A k G S and 

P[D] = 1 (continuity of measure). Finally, \ \g k (x) - h\\ v < B x D+\\h\\ v < (B + 1) X D + \\h\\l =: f(x,h). f(x,h) is 
integrable as a sum of integrable functions. Hence, we can apply the Lebesgue dominated convergence theorem ( |Fremlin| 
2000)[Th. 123C] and we get that lim^oo 1 1 firfe (x) — h\\\> is integrable and 



lim £[g k ] = / lim \\g k (x) - h\\ v dP(x,h). 

k-^-oo J jj k-^-oo 
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Furthermore, lim^oo ||<7fc(x) — h\\v — \\fJ>(x) — h\\y on D as for any e > there exists a N such that for all k > N 
I \\9k{x) - h\\ v - \\fi(x) - h\\v\ < \\9k(x) - n(x)\\v < ^2(g k (x) - p,(x),e.i) v = ^(fffe - ^,T x e l )r < e 



i<u 



as gk converges weakly to /i and u < oo. Hence, lim^oo £[gk] = and \x £ T-Lr- Also, 

inf £[g] < S[fi] < inf £[g] + l/n 

for all n E N. Therefore, mi ge ^ G £[g] — £[fi]. Hence, ji* := fi is a minimiser. 



□ 



Furthermore, the probability measure needs to factor as P(h, x) = Px{x)P(h\x) and there needs to be constants c, d such 
that 

ij^M _ \\h-^(x)\\ v A * 
a J 2d 1 



In particular, exp ^^ t^js^ht ^ _ W h i- L ( x )\\v _ ^ nee( j s j- be _P(-|x) integrable. In case that P(h\x) is concentrated 
on C := {L{x, -)\x E X} we have exp( l|, '~ A '' ( " )llv ) - M^ipk _ i < ^^ MwMWk j on £ ; where we used 

Corollary |4] and restricted d > 0. As /.t* has finite RKHS norm and as we have a probability measure the assumption is 
always fulfilled in our case. 



B.4. Discussion of assumptions from ( |Song et aLl |2009) 

We review conditions from (Son g et aTj |2009[ |Fukumizu et al. 201 1] > that establish the existence of /i G Hr such that 

E(h(Y)\X = x) = (h, n(x))n, Vh E Hl, as well as convergence results for an empirical estimate of /.i (note that certain 

^ j ; 1 1 1 

additional conditions over those stated in (Fukumizu et al. 2011) are needed for the results to be rigorous, following the 

discussion in (Fukumizu et al. 201 1 Appendix, Theorem 6.6): we include these here). To describe these conditions, some 

preliminary definitions will be needed. Define the uncentred covariance operators, 



C XX =^xK{X 1 -)®K{X,-) C 



XY 



E XY K(X, •) ® L(X, ■ 



where the tensor product satisfies (a ® b)c — (b, c)^ a for a E Hk an d b,c E Hl (an analogous result applies for 
a,b,c E Hk)- As the mappings in Song et al. are defined from Hk to Hl, rather than X to Hl, we next establish the 
relation between the Song et al. notation and ours: for every h E Hl, 

{h,n(x)) Hj . = (h,U Y \x(/>{x)) nL , 

where the notation on the left is from the present paper, that on the right is from Song et al., and Uy\x '■ Hk — > Hl (we 
therefore emphasize Uy\x £ Hr, since its input space is Hk, however Uy\x ° 4>{x) E Hr under the conditions stated 
below). We have the following theorem, corresponding to (Son g et aL]|2009| Theorem 1). 

Theorem B.4 (( |Song et aT] [2009)). Assume, that for all h E Hl, we have E (h(Y)\X = ■) E Hk, Cxx is injective, and 
the operator Sh := E (h(Y)\X = ■) is bounded for all h E Hl- Song et al . make the further smoothness assumption that 



Cxx Cxy is Hilbert-Schmidt (in fact, following [Fukumizu et al. 2011 Appendix), boundedness should be sufficient, 
although this is outside the scope of the present paper). Then the definition Uy\x '■— CyxC xx satisfies both Uy\x ° 

4>(x) E Hr and fi(x) = Uy\x<t>(x), and an empirical estimate Uy\x '•= Cyx [Cxx + S n IJ converges with rate 



Cyx (pxx + 5 n l) Hx) ~ C YX C xx (f>(x) 



cyn- 1 / 4 ). 



when S n = n 1 / 4 . 
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C. Proofs 

Convergence result from (Song et al.| |2009[ > The theorem in (Son g_et aL) [2 009) states the following: Let P be a 



probability measure and let U be the estimate of U defined in ( Song et al.||2009[ ) then there exists a constant C such that 



lim^oo P n [||[/ - U\\hs > Cn _1/8 ] = 0, where HS denotes Hilbert-Schmidt norm. 

Under the assumption that K(x, x) < B for all a; € A" this implies: for a probability measure P and an estimate U of U it 
holds that there exists a constant t such that lim n _^ QO P n [sup heB ^ Ex[E[h\X] — (h, jl(X))y} 2 > ™ -1 / 4 ] = 0. 



Proof. Let ||| • ||| be the operator norm. We have that |||^4||| < ||^4||_f/s for a Hilbert-Schmidt operator (Werner 



2002 )[p. 268, Satz VI. 6. 2c]. Using this fact and Cauchy-Schwarz we get: 



sup E x [E[h\X]- (h,UK(X,-)) v f <E X \\(U - U) * { *\]. \\ 2 V \\K(X, -)\\ K < \\\U - U\\?^xK 2 (X,X) 
<B\\U-U\\% a 

and the statement follows with r := C 2 /B. □ 



Example The space X together with the reproducing kernel of Hk can lead to problems for the assumption from ( jSong 
[etaLl [20091 : 

Corollary C.l. Let V be finite dimensional such that a function h G V exists with h(y) > e > for all y E y. 
Furthermore, let X := [—1, 1] and the reproducing kernel for "Hk be K(x, y) = xy. Then there exists no measure which 
fulfills the assumption from eq. \13\ 

Proof. Assume there is a measure and a measure space (y,Y,,/j,). We can assume that all ft. € V are integrable as 
otherwise the assumption fails. We have E[/i| X — x] > E[e\y\X — x] — e > 0. However, by assumption E[/i|X = x] = 
J2iLi{h> hi) L^[hi\X = x] = J2i<m(h, hi)vfi( x ) f° r some e Hk, where m is the dimension of V and {/ii}i< m is an 
orthonormal basis. = {z ■ \z e M} and hence E[h(Y)\X = x] = (h,J2i< m z ixhi)v — x(h,J2i< m z ihi)v- If the 
inner product is positive then we get E[/i(F) \X = x] < for x g [—1,0] and a contradiction. Similarly for the case where 
the inner product is negative. □ 

D. Justification of sparsity conditional mean embedding 

In this section, we provide a justification for the objective function (16) which we used in Section 5 to derive a sparse 
conditional mean embedding. Specifically, we show that the objective function (16) provides a natural upper bound to (5), 
which is the error measure we used to derive the embedding itself. 

For simplicity, we assume that the underlying conditional mean embedding belongs to a reproducing kernel Hilbert space 
of 'Hi-valued functions, Hr, with operator-value kernel T 6 C(Hl)- That is, there exists /i* € Hr, such that, for every 
h G T~Ll and x € X, it holds that Ey [h(Y)\X = x] = (/i*(x), h) l. Under this assumption we have that 

E Y [h(Y)\X]-(h,fi(X)) L = (ij*(x)-n(x),h) L 

= (M* - H,T(x,-)h)r 

< ||/z*-M||r||r(av)/i|U 

< ii^-Hlrllir^^iiis^iu. 

where the second equality follows from the reproducing property of the kernel T, the first inequality follows from Cauchy 



Schwarz's inequality, the last inequality follows from (Micchelli & Pontil 2005 Proposition 1(d)) and || |r(a;, x)\\ \ denotes 
the operator norm. Consequently, 

£\li] = sup E x \(E Y [h(Y)\X]~{h,n{X)) L f 

\\h\\ L <l L 

< c|| M *-/i||r (21) 



where we have defined c = E x \\ \T{X, X)\\\. 
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We choose fi to be the solution ft of problem (9). As noted at the beginning of Section 5, in many practical circumstances, 
we wish to approximate ft by a sparse version /Sparse which involves a smaller number of parameters. For this purpose, it 
is natural to use the error measure || ■ Indeed, equation pTfr and the triangle inequality yield that 

£ [^sparse] < c(||/i* - ft\\ V + \\ft ~ ^sparse 1 1 r) ^ ■ 

Thus if ft estimates well the true embedding fi*, so will the sparse mean embedding Sparse- 

In the specific case considered in Section 5, T(x, x') = K(x, x')Id, so that || \T(x, x)\\ \ = K(x, x) and || • ||r = || • ||.ff®L- 
Furthermore, we choose ft, = Y^j=i WijK Xi L Vj and /i S parsc = MijK Xi L Vj , where matrix M is encouraged to 

be sparse. Different sparsity methods are discussed in Section 5, such as the the Lasso method (cf. equation (15)). 



